Acknowledgements

  • The original version of this material stems from Mette Langaas who has put a lot of effort in creating this module. Thanks to Mette for the permission to use the material!

Learning material for this module

  • James et al (2021): An Introduction to Statistical Learning. Chapter 10.

  • All the material presented on these module slides and in class.

  • Videos on neural networsk and back propagation

\(~\)

Secondary material (not compulsory):

\(~\)

See also References and further reading (last slide), for further reading material.


What will you learn?

Todo: Update

\(~\)

  • Deep learning: The timeline

  • Single and multilayer neural networks

  • Convolutional neural networks

  • Recurrent neural networks

  • Interpolation and double descent

\(~\)

  • Neural network parts: model – method – algorithm – recent developents

Introduction

  • Neural networks (NN) were first introduced in the 1990’s.

  • Shift from statistics to computer science and machine learning, as they are highly parameterized

  • Statisticians were skeptical: ``It’s just a nonlinear model’’.

  • After the first hype, NNs were pushed aside by boosting and support vector machines.

  • Revival since 2010: The emergence of Deep learning as a consequence of improved computer resources, some innovations, and applications to image and video classification, and speech and text processing


Deep learning

Deep Learning is an algorithm which has no theoretical limitations of what it can learn; the more data you give and the more computational time you provide, the better it is.

Geoffrey Hinton (Google)


Timeline

(based on Chollet and Allaire (2018))

\(~\)

  • Neural networks were investigated in “toy form” in the 1950s.
  • The first big step was taken in the 1980s when the backpropagation algorithm was rediscovered.
  • In 1989 (Bell Labs, Yann LeCun) used convolutional neural networks to classifying handwritten digits.
  • Not so much activity in the neural network field in the 2000s.

  • In 2011 neural networks with many layers (and trained with GPUs) were performing well on image classification tasks.
  • The ImageNet classification challenge (classify high resolution colour images into 1k different categories after training on 1.4M images) was won by solutions with deep convolutional neural networks (convnets). In 2011 the accuracy was 74.3%, in 2012 83.6% and in 2015 96.4%.
  • From 2012, convnets is the general solution for computer vision tasks. Other application areas are natural language processing.

Deep?

\(~\)

  • Deep learning does not mean a deeper understanding, but refers to sucessive layers of representations - where the number of layers gives the depth of the model. Often tens to hundreds of layers are used.
  • Deep neural networks are not seen as models of the brain, and are not related to neurobiology.
  • A deep network can be seen as many stages of information-destillation, where each stage performes a simple data transformation. These transformations are not curated by the data analyst, but is estimated in the network.
  • In contrast, in statistics we first select a set of inputs, then look at how these inputs should be transformed, before we apply some statistical methods (feature engineering).

  • The success of deep learning is dependent upon the breakthroughts
    • in hardware development, expecially with faster CPUs and massively parallell graphical processing units (GPUs).
    • datasets and benchmarks (internet/tech data).
    • improvemets of the algorithms.

\(~\)

  • Achievements of deep learning includes high quality (near-human to super human) image classification, speech recognition, handwriting transcription, machine translation, digital assistants, autonomous driving, advertise targeting, web searches, playing Go and chess.

Why a module on neural networks?

\(~\)

  • Every day you read about the success of AI, machine learning – and in particular deep learning.

\(~\)

  • In the last five years the field of deep learning has gone from low level performance to excellent performance – particularly in image recognition and speech transcription.

\(~\)

  • Deep learning is based on a layered artificial neural network structure.

\(~\)

So we first need to understand: what is a neural network?


Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs at axon terminals. Image credits: By Egm4313.s12 (Prof. Loc Vu-Quoc) https://commons.wikimedia.org/w/index.php?curid=72816083


  • There are several (self-study) learning resources (some listed under ‘further references’) that the students my turn to for further knowledge into deep learning, but this presentation is heavily based on Chollet and Allaire (2018), with added formulas and theory.
  • There is a new IT3030 deep learning course at NTNU.


AI, machine learning and statistics


  • Artificial intelligence (AI) dates back to the 1950s, and can be seen as the effort to automate intellectual tasks normally performed by humans (page 4, Chollet and Allaire (2018)).

\(~\)

  • AI was first based on hardcoded rules (like in chess programs), but turned out to be intractable for solving more complex, fuzzy problems.

\(~\)


Machine learning

\(~\)

  • With the field of machine learning the shift is that a system is trained rather than explicitly programmed.

\(~\)

  • Machine learning is related to mathematical statistics, but differs in many ways.

\(~\)

  • ML deals with much larger and more complex data sets than what is usually done in statistics.

\(~\)

  • The focus in ML is oriented towards engineering, and ideas are proven empirically rather than theoretically (which is the case in mathematical statistics).

\(~\)

According to Chollet and Allaire (2018) (page 19):

Machine learning isn’t mathematics or physics, where major advancements can be done with a pen and a piece of paper. It’s an engineering science.


Deep learning

\(~\)

  • Deep does not refer to a deeper understanding.

\(~\)

  • Rather, deep referes to the layers of representation, for example in a neural network.

\(~\)


From statistics to artificial neural networks

Recapitulate from Module 3 with the bodyfat dataset that contained the following variables.

  • bodyfat: % of body fat.
  • age: age of the person.
  • weight: body weighth.
  • height: body height.
  • neck: neck thickness.
  • bmi: bmi.
  • abdomen: circumference of abdomen.
  • hip: circumference of hip.

We will now look at modelling the bodyfat as response and using all other variables as covariates - this will give us

  • seven covariates
  • one intercept
  • one numerical output

Let \(n\) be the number of observations in the training set, here \(n=243\).


Multiple linear regression model

(from Module 3)

\(~\)

We assume \[ Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i={\boldsymbol x}_i^T{\boldsymbol\beta}+\varepsilon_i \ , \]

for \(i=1,\ldots,n\), where \(x_{ij}\) is the value \(j\)th predictor for the \(i\)th datapoint, and \({\boldsymbol\beta}^\top = (\beta_0,\beta_1,\ldots,\beta_p)\) the regression coeffficients.

\(~\)

We used the compact matrix notation for all observations \(i=1,\ldots,n\) together: \[{\boldsymbol Y}={\boldsymbol {X}} \boldsymbol{\beta}+{\boldsymbol{\varepsilon}} \ .\]


Assumptions:

  1. \(\mathsf{E}(\boldsymbol{\varepsilon})=\boldsymbol{0}\).
  2. \(\mathsf{Cov}(\boldsymbol{\varepsilon})=\mathsf{E}(\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^T)=\sigma^2\boldsymbol{I}\).
  3. The design matrix has full rank, \(\text{rank}({\bf X})=p+1\). (We assume \(n>>(p+1)\).)

The classical normal linear regression model is obtained if additionally

  1. \(\boldsymbol\varepsilon\sim N_n({\boldsymbol 0},\sigma^2 {\bf I})\) holds. Here \(N_n\) denotes the \(n\)-dimensional multivarate normal distribution.

From statistical model to network architecture

\(~\)

How can our statistical model be represented as a network?

\(~\)

We need new concepts:

\(~\)

  • Covariates \(\rightarrow\) input nodes in an input layer.

\(~\)

  • The intercept \(\rightarrow\) bias node.

\(~\)

  • The response \(\rightarrow\) output node in an output layer.

\(~\)

  • The regression coefficients \(\rightarrow\) weights (often written on the arrows from the inputs to the output layer).

## # weights:  8
## initial  value 498619.660230 
## iter  10 value 4521.462919
## final  value 4415.453729 
## converged

\(~\)

  • All lines going into the output node signifies that we multiply the covariate values in the input nodes with the weights (regression coefficients), and then sum.
  • This sum can be sent through a so-called activition function (here just the identity function).

  • Regression notation:

\[\begin{equation*} Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i \ . \end{equation*}\]

\(~\)

  • Neural network notation: \[\begin{equation*} y_1({\boldsymbol x}_i)=\phi_o(w_0+w_1 x_{i1}+\cdots + w_p x_{ip}) \ , \end{equation*}\] where \(\phi_o(x)=x\) (identity function).

\(~\)

In the statistics world

  • we would have written \(\hat{y}_1({\boldsymbol x}_i)\) to specify that we are estimating a predicted value of the response for the given covariate value.

  • we would have called the \(w\)s \(\hat{\beta}\)s instead.


Statistical parameter estimation

\(~\)

Remember: The estimator \(\hat{\boldsymbol \beta}\) is found by minimizing the RSS for a multiple linear regression model: \[ \begin{aligned} \text{RSS} &=\sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} -...-\hat \beta_p x_{ip} )^2 \\ &= \sum_{i=1}^n (y_i-{\boldsymbol x}_i^T \boldsymbol \beta)^2=({\boldsymbol Y}-{\boldsymbol X}\hat{\boldsymbol{\beta}})^T({\boldsymbol Y}-{\boldsymbol X}\hat{\boldsymbol{\beta}}) \ .\end{aligned} \] Solution: \[ \hat{\boldsymbol\beta}=({\boldsymbol X}^T{\boldsymbol X})^{-1} {\boldsymbol X}^T {\boldsymbol Y} \ .\]


Neural networks: loss function and gradient descent

We now translate from the statistical into the neural networks world:

  1. Replace the parameters \(\boldsymbol\beta\) with network weights \(\boldsymbol w\).

\(~\)

  1. Replace the RSS in our training data set with the following loss function (mean squared error) \[ J({\boldsymbol w})=\frac{1}{n}\sum_{i=1}^n (y_i-{\hat{y}_1({\boldsymbol x}_i)})^2 \ ,\] where \(J({\boldsymbol w})\) indicates that the unknown parameters are the weights \({\boldsymbol w}\).

\(~\)

  1. Replace minimizing the loss function (RSS) by more general minization procedures that work also when the loss function does not have a closed form. In particular, use the the gradient descent algorithm.

Gradient descent and local minima

\(~\)

\(~\)

(https://github.com/SoojungHong/MachineLearning/wiki/Gradient-Descent)

\(~\)


Finding optimal weights: Gradient descent algorithm

  1. Let \(t=0\) and denote the given initial values for the weights \({\boldsymbol w}^{(t)}\),
  2. Until finding a (local) optimum, repeat a) to e)
    1. Calculate the predictions \({\hat{y}_1({\boldsymbol x}_i)}\).
    2. Calculate the loss function \(J({\boldsymbol w}^{(t)})\).
    3. Find the gradient (direction) in the \((p+1)\)-dimensional space of the weights, and evaluate this at the current weight values \(\nabla J({\boldsymbol w}^{(t)})={\frac{\partial J}{\partial {\boldsymbol w}}}({\boldsymbol w}^{(t)})\).
    4. Go with a given step length (learning rate) \(\lambda\) in the direction of the negative of the gradient of the loss function to get \[{\boldsymbol w}^{(t+1)}={\boldsymbol w}^{(t)} - \lambda \nabla J({\boldsymbol w}^{(t)}) \ .\]
    5. Set \(t=t+1\).
  3. The final values of the weights in that \((p+1)\) dimensional space are our parameter estimates and your network is trained.

Backpropagation

  • In neural networks the gradient part of the gradient descent algorithm is implemented efficiently in an algorithm called backpropagation (see later).

Here we compare

  • the MLR solution with lm
  • the neural network solution with nnet

Example 1: Continuous outcome (linear regression)

\(~\)

Linear regression vs. neural networks: an example.

\(~\)

fit = lm(bodyfat ~ age + weight + height + bmi + neck + abdomen + hip,
    data = d.bodyfat)
fitnnet = nnet(bodyfat ~ age + weight + height + bmi + neck + abdomen +
    hip, data = d.bodyfat, linout = TRUE, size = 0, skip = TRUE, maxit = 1000,
    entropy = FALSE)
## # weights:  8
## initial  value 114935.463194 
## iter  10 value 4471.423775
## final  value 4415.453729 
## converged
cbind(fitnnet$wts, fit$coefficients)
##                      [,1]          [,2]
## (Intercept) -9.748907e+01 -9.748903e+01
## age         -9.607663e-04 -9.607669e-04
## weight      -6.292823e-01 -6.292820e-01
## height       3.974886e-01  3.974884e-01
## bmi          1.785331e+00  1.785330e+00
## neck        -4.945725e-01 -4.945725e-01
## abdomen      8.945189e-01  8.945189e-01
## hip         -1.255549e-01 -1.255549e-01

Example 2: Binary outcome (logistic regression)

Aim is to predict if a person has diabetes. The data stem from a population of women of Pima Indian heritage in the US, available in the R MASS package. The following information is available for each woman:

\(~\)

  • diabetes: 0= not present, 1= present
  • npreg: number of pregnancies
  • glu: plasma glucose concentration in an oral glucose tolerance test
  • bp: diastolic blood pressure (mmHg)
  • skin: triceps skin fold thickness (mm)
  • bmi: body mass index (weight in kg/(height in m)\(^2\))
  • ped: diabetes pedigree function.
  • age: age in years

The statistical model: Logistic regression

\(~\)

  • \(i=1,\ldots, n\) observations in the training set. We will use \(r\) (instead of \(p\)) to be the number of covariates, to avoid confusion with the probability \(p\).

\(~\)

  • The binary reponse \(Y_i \in \{1, 0\}\) with \[\text{P}(Y_i =1 ) = p_i\] and \[\log \Big ( \frac{p_i}{1-p_i}\Big ) = \beta^\top {\boldsymbol x}_i \ ,\] where \(\log(\frac{x}{1-x})\) is the logistic function .

  • Therefore \[ p_i = \frac{\exp(\beta^\top {\boldsymbol x}_i)}{1+\exp(\beta^\top {\boldsymbol x}_i)} = \frac{1}{1+\exp(-\beta^\top {\boldsymbol x}_i)} \ . \]

\(~\)

  • This function is S-shaped, and ranges between 0 and 1 (so the \(p_i\) is between 0 and 1).


Parameter estimation in the statistical model

(Maximum likelihood)

  • Given \(n\) independent pairs of covariates and responses \(\{x_i, y_i\}\), the log-likelihood function of a logistic regression model can be written as:

\[ \ln(L(\boldsymbol{\beta}))=l(\boldsymbol{\beta}) =\sum_{i=1}^n \Big ( y_i \ln p_i + (1-y_i) \ln(1 - p_i )\Big ) \ .\]

  • Maximiation: set the \(r+1\) partial derivatives (to form the gradient) to 0.
  • No closed form solution, thus we use a gradient-based method (the Newton-Raphson or Fisher scoring algorithm to find \(\hat\beta\) and \(SD(\hat\beta)\): \[ \beta^{(t+1)}=\beta^{(t)} + {\boldsymbol F}({\boldsymbol \beta}^{(t)})^{-1} s(\beta^{(t)}) \ , \] where the gradient of the log-likelihood \({\boldsymbol s}({\boldsymbol \beta})=\frac{\partial l}{\partial \boldsymbol \beta}\) is called the score vector, and here the new quantity \({\boldsymbol F}({\boldsymbol \beta}^{(t)})^{-1}\) is called the inverse expected Fisher information matrix.

The neural network model: architecture and activation function

\(~\)

  • Remember: in the neural network (NN) version of linear regression, we had: \[ y_1({\boldsymbol x}_i)=w_0+w_1 x_{i1}+\cdots + w_r x_{ir} \ , \] with activation function \(\phi_o(x)=x\).

\(~\)

  • In the NN version of logistic regression we instead have the sigmoid activation function \(\phi_o(x)=\frac{1}{1+\exp(-x)}\), often denoted as \(\sigma(x)\). Again, we prefer to use \(\hat{y}_1({\boldsymbol x}_i)\) and get:

\[ \hat{y}_1({\boldsymbol x}_i)= \sigma({\boldsymbol x}_i) = \frac{1}{1+\exp(-(w_0+w_1 x_{i1}+\cdots + w_r x_{ir}))} \in (0,1) \ . \]


Neural networks: loss function and gradient descent

  • For NNs we use binomial cross-entropy loss \[ J({\boldsymbol w})=-\frac{1}{n}\sum_{i=1}^n (y_i\ln({\hat{y}_1({\boldsymbol x}_i)})+(1-y_i)\ln(1-{\hat{y}_1({\boldsymbol x}_i)}) \ , \] which is a scaled version of the negative of the binomial loglikelihood!

\(~\)

  • Optimization is done also with gradient descent, but we need the chain rule (due to the activation function) to get the partial derivatives for the gradient direction.

\(~\)

  • We need the backpropagation algorithm, using the activation and loss functions given here.

Parameter estimation vs. network weights

\(~\)

fitlogist = glm(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
    data = train, family = binomial(link = "logit"))
summary(fitlogist)
## 
## Call:
## glm(formula = diabetes ~ npreg + glu + bp + skin + bmi + ped + 
##     age, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9830  -0.6773  -0.3681   0.6439   2.3154  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.773062   1.770386  -5.520 3.38e-08 ***
## npreg        0.103183   0.064694   1.595  0.11073    
## glu          0.032117   0.006787   4.732 2.22e-06 ***
## bp          -0.004768   0.018541  -0.257  0.79707    
## skin        -0.001917   0.022500  -0.085  0.93211    
## bmi          0.083624   0.042827   1.953  0.05087 .  
## ped          1.820410   0.665514   2.735  0.00623 ** 
## age          0.041184   0.022091   1.864  0.06228 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 256.41  on 199  degrees of freedom
## Residual deviance: 178.39  on 192  degrees of freedom
## AIC: 194.39
## 
## Number of Fisher Scoring iterations: 5

set.seed(787879)
library(nnet)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
    data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 1000,
    entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 0.1))
## # weights:  8
## initial  value 213.575955 
## iter  10 value 89.511044
## final  value 89.195333 
## converged
# entropy=TRUE because default is least squares
cbind(fitnnet$wts, fitlogist$coefficients)
##                     [,1]         [,2]
## (Intercept) -9.773046277 -9.773061533
## npreg        0.103183171  0.103183427
## glu          0.032116832  0.032116823
## bp          -0.004767678 -0.004767542
## skin        -0.001917105 -0.001916632
## bmi          0.083624151  0.083623912
## ped          1.820397792  1.820410367
## age          0.041183744  0.041183529

\(~\)

By setting entropy=TRUE we minimize the cross-entropy loss.


plotnet(fitnnet)


But, there may also exist local minima.

\(~\)

set.seed(123)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
    data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 10000,
    entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 1))
## # weights:  8
## initial  value 24315.298582 
## final  value 12526.062906 
## converged
cbind(fitnnet$wts, fitlogist$coefficients)
##                     [,1]         [,2]
## (Intercept)   -36.733537 -9.773061533
## npreg         -77.126994  0.103183427
## glu         -2984.409175  0.032116823
## bp          -1835.934259 -0.004767542
## skin         -718.072629 -0.001916632
## bmi          -818.561311  0.083623912
## ped            -8.687473  1.820410367
## age          -773.023878  0.041183529

Why can NN and logistic regression lead to such different results?


Example 3: Categorical outcome (multiclass regression)

\(~\)

The iris flower data set was introduced by the British statistician and biologist Ronald Fisher in 1936.

\(~\)

  • Three plant species: {setosa, virginica, versicolor}.
  • Four features: Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.

\(~\)

The aim is to predict the species of an iris plant.


The statistical model

\(~\)

We only briefly mentioned multiclass regression in module 4.

  • Assume we have independent observation pairs \(({\boldsymbol x}_i, Y_i)\), where the covariate vector \({\boldsymbol x}_i\) consists of the same measurements for each response category.

\(~\)

  • Each observation can only belong to one response class, \(Y_i \in \{1, \ldots ,C\}\).

\(~\)

  • Dummy variable coding of the response in a \(C\)-dimensional vector: \({\boldsymbol y}_i=(0,0,\ldots,0,1,0,\ldots,0)\) with a value of \(1\) in the \(c^{th}\) element of \({\boldsymbol y}_i\) if the class is \(c\).

  • Probabilities that the response is category \(c\) for subject \(i\) \[p_{ic}=P(Y_i=c) \ ,\] where \(\sum_{c=1}^C p_{ic}=1\). In statistics we do not model \(p_{i1}\), because \(p_{i1}=1-\sum_{c=2}^{C} p_{ic}\), and \(\beta_1=0\).

\(~\)

  • Generalization of the logistic regression model: \[p_{ic}=P(Y_i=c)= \frac{\exp({\boldsymbol x}_i^T{\boldsymbol \beta}_c)}{1+\sum_{s=2}^{C}\exp({\boldsymbol x}_i^T{\boldsymbol \beta}_s)}\]

\(~\)

  • Classification to the class with the highest probability, \(\underset{c}{\text{argmax}}(p_{ic})\).

Parameter estimation in the statistical model

\(~\)

  • The likelihood of the multinomial regression model can be written as \[ \ln(L({\boldsymbol \beta})\propto \sum_{i=1}^n \sum_{c=1}^C y_{ic}\ln(p_{ic}) \ ,\] where \(p_{iC}=1-p_{i1}-p_{i2}-\cdots -p_{i,C-1}\), and the regression parameters enter via the \(p_{ic}\)s.

\(~\)

  • Parameter estimation is done in the same way as for the logistic regression, with the Fisher scoring algorithm.

\(~\)

  • However (and this might be confusing), an efficient function in R also relies on neural networks for optimization (see below).

Neural network architecture and activation function

  • Builds an output layer with \(C\) nodes and corresponding 0/1 targets (responses) using the dummy variable coding of the responses, called one-hot coding.
  • The activation function for the ouput layer is called softmax. For each class \(c=1,\ldots, C\) it is given as \[ \hat{y}_c({\boldsymbol x}_i) = \frac{\exp({\boldsymbol x}_i^T{\boldsymbol w}_c)}{\sum_{s=1}^{C}\exp({\boldsymbol x}_i^T{\boldsymbol w}_s)} \ , \] where each \({\boldsymbol w}_s\) is a \(r+1\) dimensional vector of weights.
  • Note: there is some redundancy here, since \(\sum_{c=1}^C {\hat y}_{c}({\boldsymbol x}_i)=1\), so we could have had \(C-1\) output nodes, but this is not done.
  • The focus of neural networks is not to interpret the weights, and there is no need to assume full rank of a matrix with output nodes.

Neural networks: loss function and gradient descent

\(~\)

  • For neural networks the negative of the multinomial loglikelihood is a scaled version of the categorical cross-entropy loss \[ J({\boldsymbol w})=-\frac{1}{n}\sum_{i=1}^n\frac{1}{C} \sum_{c=1}^C (y_{ic}\ln({\hat{y}_c({\boldsymbol x}_i)}) \ .\]

\(~\)

  • The optimization is done using gradient descent, with minor changes from what was done for the logistic regression due to the added sum and the change in the activation function.

Fitting multinomial regression vs a neural network

\(~\)

First select a training sample

\(~\)

library(nnet)
set.seed(123)
train = sample(1:150, 50)
iris_train = ird[train, ]
iris_test = ird[-train, ]

\(~\)

Then fit the nnet() (by default using the softmax activation function)

set.seed(1234)
iris.nnet <- nnet(species ~ ., data = ird, subset = train, size = 0,
    skip = TRUE, maxit = 100)
## # weights:  15
## initial  value 105.595764 
## iter  10 value 1.050064
## iter  20 value 0.018814
## iter  30 value 0.003937
## iter  40 value 0.002062
## iter  50 value 0.001460
## iter  60 value 0.000150
## iter  70 value 0.000125
## iter  80 value 0.000110
## final  value 0.000096 
## converged

  • How many weights (parameters) have been estimated?

  • What does the graph look like?

\(~\)

summary(iris.nnet)
## a 4-0-3 network with 15 weights
## options were - skip-layer connections  softmax modelling 
##  b->o1 i1->o1 i2->o1 i3->o1 i4->o1 
##  36.79   9.69   1.26 -14.33 -13.24 
##  b->o2 i1->o2 i2->o2 i3->o2 i4->o2 
##   5.16  10.10  21.33 -25.47 -12.48 
##  b->o3 i1->o3 i2->o3 i3->o3 i4->o3 
## -42.03 -20.25 -23.12  40.80  25.96


Multinomial regression is also done with nnet, but using a wrapper multinom (we use default settings, so results are not necessarily the same as above).

\(~\)

library(caret)
fit = multinom(species ~ ., family = multinomial, data = iris_train,
    trace = FALSE)
coef(fit)
##   (Intercept)   Sepal.L.  Sepal.W.  Petal.L.  Petal.W.
## s   -15.67359  -8.679545  42.40492 -25.94868 -7.132379
## v  -157.32804 -19.694488 -17.03445  46.05297 59.135341
fit$wts
##  [1]    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
##  [7]    0.000000  -15.673595   -8.679545   42.404924  -25.948681   -7.132379
## [13]    0.000000 -157.328044  -19.694488  -17.034447   46.052968   59.135341

Summing up

\(~\)

  1. Multiple linear regression
    • NN with one input layer and one node in the output layer,
    • linear activation function,
    • mean squared loss.

\(~\)

  1. Logistic regression (2 classes)
    • NN with one input layer and one node in the output layer,
    • sigmoid activation function,
    • and binary cross-entropy loss.

\(~\)

  1. Multinomial regression (\(C>2\) classes)
    • NN with one input layer and \(C\) nodes in the output layer,
    • softmax activation function,
    • categorical cross-entropy loss.

\(~\) But:

  • These are only linear models (linear boundaries).

  • Parameters (weights) found using gradient descent algorithms where the learning rate (step length) must be set.


Feedforward networks

  • Connections are only forward in the network, but no feedback connections that sends the output of the model back into the network.

  • Examples: Linear, logistic and multinomial regression with or without any hidden layers (between the input and output layers).

  • We may have between zero and very many hidden layers.

  • Adding hidden layers with non-linear activation functions between the input and output layer will make nonlinear statistical models.

  • The number of hidden layers is called the depth of the network, and the number of nodes in a layer is called the width of the layer.


test


The single hidden layer feedforward network

\(~\)

The nodes are also called neurons.

\(~\)

Notation

\(~\)

  1. Inputs (input layer nodes), \(j=1,\dots p\): \(x_1, x_2, \ldots, x_p\), or as a vector \({\boldsymbol x}\).
  2. The nodes \(z_m\) in the hidden layer, \(m=1,\ldots, M\); as vector \({\boldsymbol z}^\top=(z_1, \ldots, z_M)\), and the hidden layer activation function \(\phi_h\). \[ z_m({\boldsymbol x})=\phi_h(\alpha_{0m}+\sum_{j=1}^p \alpha_{jm}x_{j}) \] where \(\alpha_{jm}\) is the weight from input \(j\) to hidden node \(m\), and \(\alpha_{0m}\) is the bias term for the \(m\)th hidden node.

  1. The node(s) in the output layer, \(c=1,\ldots C\): \(y_1, y_2, \ldots, y_C\), or as vector \({\boldsymbol y}\), and output layer activation function \(\phi_o\). \[ \hat{y}_c({\boldsymbol x})=\phi_o(\beta_{0c}+\sum_{m=1}^M \beta_{mc}z_{m}({\boldsymbol x})) \] where \(\beta_{mc}\) is from hidden neuron \(m\) to ouput node \(c\), and \(\beta_{0c}\) is the bias term for the \(c\)th output node.

\(~\)

  1. Taken together \[ \hat{y}_c({\boldsymbol x})=\phi_o(\beta_{0c}+\sum_{m=1}^M \beta_{mc}z_{m})=\phi_o(\beta_{0c}+\sum_{m=1}^M \beta_{mc}\phi_h(\alpha_{0m}+\sum_{j=1}^p \alpha_{jm}x_{j})) \]

Hands on:

\(~\)

  • Identify \(p, M, C\) in the network figure above, and relate that to the \(y_{c}({\boldsymbol x})\) equation.

\(~\)

  • How many parameters need to be estimated for this network?

\(~\)

  • What determines the values of \(p\) and \(C\)?

\(~\)

  • How is \(M\) determined?

Special case: linear activation function for the hidden layer

\(~\)

If we assume that \(\phi_h(z)=z\) (linear or identity activiation):

\[ \hat{y}_c({\boldsymbol x})=\phi_o(\beta_{0c}+\sum_{m=1}^M \beta_{mc}(\alpha_{0m}+\sum_{j=1}^p \alpha_{jm}x_{j})) \]

\(~\) \(~\)

Q: Does this look like something you have seen before?

\(~\)

A:


Universal approximation property

\(~\)

  • Think of the goal of a feedforward network to approximate some function \(f\), mapping our input vector \({\boldsymbol x}\) to an output value \({\boldsymbol y}\).

\(~\)

  • What type of mathematical function can a feedforward neural network with one hidden layer and linear output activation represent?

\(~\)

The universal approximation theorem says that a feedforward network with

  • a linear output layer
  • at least one hidden layer with a “squashing” activation function and “enough” hidden units

can approximate any (Borel measurable) function from one finite-dimensional space (our input layer) to another (our output layer) with any desired non-zero amount of error.


In particular, the universal approximation theorem holds for

  • The sigmoid \(\phi(a)=1/(1+\exp(-a))\) (logistic) activation functions.
  • The rectified linear unit (ReLU) \(\phi_h(a)=\max(0,a)\) activation functions

in the hidden layer.


  • The ReLU activiation function has replaced the sigmoid function as default in the hidden layer(s) of a feedforward network.
  • Even though a large feedforward network with one hidden layer may be able to represent a desired function, we may not be able to learn (i.e., estimate the) parameters of the function. Why?
    • wrong number of nodes in the hidden layer.
    • our optimization routine may fail / may find a local minimum.
    • we may overfit / underfit the training data.
  • Alternative: networks with more than one hidden layer, but fewer total number of nodes but more layers. A network with many hidden layers is called a deep network.

The nnet and keras R packages

\(~\)

  • We will use both the rather simple nnet R package by Brian Ripley and the currently very popular keras package for deep learning (the keras package will be presented later).

  • nnet fits one hidden layer with sigmoid activiation function. The implementation is not gradient descent, but instead BFGS using optim.

  • Type ?nnet() into your R-console to see the arguments of nnet().

  • If the response in formula is a factor, an appropriate classification network is constructed; this has one output, sigmoid activation and binary entropy loss for a binary response, and a number of outputs equal to the number of classes, softmax activation and categorical cross-entropy loss for more levels.


An example

Boston house prices

Objective: To predict the median price of owner-occupied homes in a given Boston suburb in the mid-1970s using 10 input variables.

This data set is both available in the MASS and keras R package.

\(~\)

Preparing the data

  • Only 506, split between 404 training samples and 102 test samples (this split already done in the keras library)
  • Each feature in the input data (for example, the crime rate) has a different scale, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, others between 0 and 100, and so on.


References and further reading

Acknowledgements

Chollet, François, and J. J. Allaire. 2018. Deep Learning with r. Manning Press. https://www.manning.com/books/deep-learning-with-r.
Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference - Algorithms, Evidence, and Data Science. Cambridge University Press.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.